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ABSTRACT 

The “Collective Intelligence” (COIN) framework concerns 
the design of collectives of agents so that as those agents 
strive to maximize their individual utility functions, their 
interaction causes a provided “world” utility function con- 
cerning the entire collective to be also maximized. Here 
we show how to extend that framework to scenarios hav- 
ing Markovian dynamics when no re-evolution of the sys- 
tem from counter-factual initial conditions (an often expen- 
sive calculation) is permitted. Our approach transforms 
the (time-extended) argument of each agent’s utility func- 
tion before evaluating that function. This transformation 
has benefits in scenarios not involving Markovian dynam- 
ics, in particular scenarios where not all of the arguments 
of an agent’s utility function are observable. We investigate 
this transformation in simulations involving both linear and 
quadratic (nonlinear) dynamics. In addition, we find that 
a certain subset of these transformations, which result in 
utilities that have low “opacity (analogous to having high 
signal to noise) but are not “factored” (analogous to not 
being incentive compatible), reliably improve performance 
over that arising with factored utilities. We also present a 
Taylor Series method for the fully general nonlinear case. 

1. INTRODUCTION 
1.1 Background 

In this paper we are concerned with large distributed col- 
lectives of interacting goal-driven computational processes, 
where there is a provided ‘world utility’ function that rates 
the possible behaviors of that collective [29, 27]. We are 
particularly concerned with such collectives where the indi- 
vidual computational processes use machine learning tech- 
niques (e.g., Reinforcement Learning (RL) [14, 20, 19, 23]) 
to try to achieve their individual goals. We represent those 
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goals of the individual processes as maximizing an associ- 
ated ‘payoff’ utility function, one that in general can differ 

from the world utility. ., 

In such a system, we are confronted with the following 
inverse problem: How should one initiahze/update the pay- 
off utility functions of the individual processes so that the 
ensuing behavior of the entire collective achieves large val- 
ues of the provided world utility? In particular, since in 
truly large systems detailed modeling of the system is usu- 
ally impossible, how can we avoid such modeling? Can we 
instead leverage the simple assumption that our learnenng 
algorithms are individually fairly good at what they do to 
achieve a large world utility value? . 

This problem is related to work in many other fields, in- 
cluding multi-agent systems (MAS’s), computational eco- 
nomics, mechanism design, reinforcement learning, statis- 
tical mechanics, computational ecologies, (partially observ- 
able) Markov decision processes and game theory. However 
none of these fields is both applicable in large problems, and 
directly addresses the general inverse problem, rather than 
a special instance of it. (See [27] for a detailed discussion of 
the relationship between these fields, involving hundreds of 
references.) For example, the field of mechanism design is 
not generally applicable, being largely tailored to collectives 
of human beings, and in particular to the idiosyncracy of 
such collectives that their members have hidden variables 
whose values they “do not want to reveal”. There is other 
previous work that does consider the general inverse prob- 
lem, and even has each individual computational P roc ^® 
(or “agent") use reinforcement learning [2, 7, 10, 15, loj- 
However, in that work in general each process has the world 
utility function as its payoff utility function (i.e implements 
a “team game” or an “exact potential game [8]). Lnfor- 
tunately, as expounded below and in previous work, this 
approach scales extremely poorly to large problems. (In- 
tuitively, the difficulty is that each agent can have a hard 
time discerning the echo of its behavior on the world utility 
when the system is large; each agent has a horrible sign 

to-noise” problem.) , „ 

Intuitively, we are concerned with payoff utility functions 
that are “aligned” with the world utility, in that modifi- 
cations a player might make that wouldmprove its payoff 
utility also must improve world utility. Fortunately th 


‘Such alignment can be viewed as an extension of t^e a>n- 
;ept of incentive compatibility in mechanism design [9] 



equivalence class of such payoff utilities extends well be- 
yond team-game utilities. In particular, in previous work 
we used the Collective INtelligence (COIN) framework to 
derive the ‘Wonderful Life Utility’ (WLU) payoff function 
[27] as an alternative to a team-game payoff utility. The 
WLU is aligned with world utility, as desired. In addition 
though, WLU overcomes much of the signal-to-noise prob- 
lem of team game utilities [22, 29, 27, 31]. 

As an example, in some of our previous work we used 
the WLU for distributed control of network packet rout- 
ing [29]. Conventional approaches to packet routing have 
each router run a shortest path algorithm (SPA). Unlike 
with a WLU-based collective, in SPA-based routing there 
is no concern for deleterious side-effects of routing decisions 
on the global goal (e.g., no concern for bottlenecks). We 
ran simulations demonstrating that a WLU-based collective 
has substantially better throughputs than the best possible 
SPA-based system [29], even though that SPA-based system 
had information denied the COIN system. 

As another example, in [30] we considered the pared-down 
problem domain of a congestion game, in particular a more 
challenging variant of Arthur’s El Farol bar attendance prob- 
lem [1], sometimes also known as the “minority game” [6]. 

In this problem the individual processes making up the col- 
lective axe explicitly viewed as ‘players’ involved in a non- 
cooperative game. Each player has to determine which night 
in the week to attend a bar. The problem is set up so that if 
either too few people attend (boring evening) or too many 
people attend (crowded evening), the total enjoyment of the 
attending players drops. Our goal is to design the payoff 
functions of the players so that the total enjoyment across 
all nights is maximized. In this previous work we showed 
that use of the WLU can result in performance orders of 
magnitude superior to that of team game utilities. 

1.2 The Contribution of This Paper 

In this paper we extend this previous work with an ap- 
proach based on Transforming Arguments Utility functions 
(TAU) before the evaluation of those functions. The TAU 
process was originally designed to be applied to the indi- 
vidual utility functions of the agents in systems in which 
the world utility depends on the final state in an episode 
of variables outside the collective that undergo Markovian 
dynamics, with the update rule of those variables reflecting 
the state of the agents at the beginning of the episode. This 
is a very common scenario, obtaining whenever the agents in 
the collective act as control signals perturbing the evolution 
of a Markovian system. 

In the previous version of the COIN framework, to achieve 
good signal-to-noise for such scenarios might require re-evolving 
the system from counter-factual initial states of the agents to 
evaluate each agent’s reward for a particular episode. This 
can be computationally expensive. With TAU utility func- 
tions no such re-evolving is needed; the observed history of 
the system in the episode is transformed in a relatively cheap 
calculation, and then the utility function is evaluated with 
that transformed history rather than the actual one. 

The TAU process has other advantages that apply even in 
scenarios not involving Markovian dynamics. In particular 
it allows us to employ the COIN framework even when not 
all arguments of the original utility function are observable, 
due for example to communication limitations. In addition, 


certain types of TAU transformations result in utility func- 
tions that are not exactly aligned with the world utility, 
but have so much better signal-to-noise that the collective 
performs better when agents use those transformed utility 
functions than it does with exactly aligned utility functions. 

In this paper computational experiments based on linear 
and quadratic (nonlinear) update rules for the Markovian 
system are presented that verify the foregoing. In particu- 
lar in these experiments, we consider systems of 50 agents 
using a variety of world utilities and Markovian update rules. 
We compare the performance of using TAU utilities for the 
agents for linear and quadratic dynamics versus the per- 
formance using the corresponding team game utilities. We 
can also investigate systems having limited observability. In 
these cases, the performance with TAU utilities even ro- 
bustly outperforms that of team game utilities m which 
there is full observability. We also find that the non-aligned, 
high signal-to-noise utilities consistently outperform their 
factored counterparts. We end with results using a Taylor 
Series method to address the more general nonlinear case 
than the quadratic one investigated here. 


2. THE MATHEMATICS OF COLLECTIVE 
INTELLIGENCE 

We view the individual agents in the collective as players 
involved in a repeated game. 2 Let 2 with elements C be the 
space of possible joint moves of all players m the collective 
in some stage. We wish to search for the C that maximizes 
a provided world utility G ( <)• In addition to G we are 
concerned with utility functions {stj}, one such function for 
each variable/player r?. We use the notation ^ to refer to all 
players other than rj. 


2.1 Intelligence and the central equation 

We wish to “standardize” utility functions so that the 
numeric value they assign to a C only reflects their ranking 
rf r relative to certain other elements of Z. We call such 
i standardization of an arbitrary utility U for player rj the 
“intelligence for r) at C with respect to U”. Here we will 
int.pl licences that are equivalent to percentiles: 


ev«:r,) = f d^.jaemO-UiC')] 


( 1 ) 


vhere the Heaviside function 0 is defined to equal 1 when 
ts argument is greater than or equal to 0, and to equal 
) otherwise, and where the subscript on the (normalized) 
neasure dii indicates it is restricted to <' sharing the same 
ion- 7 i components as <■ In general, the measure must reflect 
;he type of system at hand, e.g., whether Z is countable 
ir not, and if not, what coordinate system is being used. 
Dther than that, any convenient choice of measure may be 
lsed and the theorems will still hold. Intelligence value are 
ilways between 0 and 1. 

Our uncertainty concerning the behavior of the system is 
•effected in a probability distribution over Z. Our ability 
;o control the system consists of setting the value of some 
iharacteristic of the collective, e.g., setting the functions of 
;he players. Indicating that value by s, our analysis revolves 


The full mathematics of the COIN framework however 
xtends significantly beyond what is needed to ad ess su 
ames. See [28]. 


non-human agents, off-equilibrium behavior, etc. 



around the following central equation for P(G | s), which 
follows from Bayes 1 theorem: 


p(G | s) = J deaP{G \ ea, a) J de g P(ea | e g ,a)P{e g \ a) , (2) 

where «:*?.), e s „ 2 (C : • ) is the vector of the 

intelligences of the players with respect to their associated 
functions, and eo = («c(C : »?t ) , eo (C : ■ • ■ ) is the vector 

of the intelligences of the players with respect to G . 

Note that €,„(< : 1 ?) = 1 means that player r? is fully 
rational at < 5 in that its move maximizes its utility, given 
the moves of the players. In other words, a point C where 
e ((: rj) ~ 1 for all players rj is one that meets the def- 
inition of a game-theory Nash equilibrium [9]. Note that 
consideration of points £ at which not all intelligences equal 
1 provides the basis for a model-independent formalization 
of bounded rationality game theory, a formalization that 
contains variants of many of the theorems of conventional 
full-rationality game theory [25]. On the other hand, a £ at 
which all components of = 1 is a local maximum of G 
(or more precisely, a critical point of the <7(0 surface). 

If we can choose s so that the third conditional probability 
in the integrand is peaked around vectors e g all of whose 
components are close to 1, then we have likely induced large 
intelligences. If in addition the second term is peaked about 
eo equal to e gy then €g will also be large. Finally, if the 
first term is peaked about high G when eb is large, then our 
choice of s will likely result in high G , as desired. 

Intuitively, the requirement that the utility functions have 
high “signal-to-noise” (an issue not considered in conven- 
tional work in mechanism design) arises in the third term. 

It is in the second term that the requirement that the util- 
ity functions be “aligned with G" arises. In this work we 
concentrate on these two terms, and show how to simulta- 
neously set them to have the desired form. 

Details of the stochastic environment in which the col- 
lective operates, together with details of the learning algo- 
rithms of the players, are reflected in the distribution P ( 0 
which underlies the distributions appearing in Equation 2. 
Note though that independent of these considerations , our 
desired form for the second term in Equation 2 is assured 
if we have chosen utility utilities such that e g equals cg ex “ 
actly for all <. We call such a system factored . In game- 
theory language, the Nash equilibria of a factored collective 
axe local maxima of G. In addition to this desirable equi- 
librium behavior, factored collectives automatically provide 
appropriate off-equilibrium incentives to the players (an is- 
sue raxely considered in game theory / mechanism design). 


2.2 Opacity 

We now focus on algorithms based on utility functions 
{g v } that optimize the signal/noise ratio reflected in the 
third term, subject to the requirement that the system be 
factored. To understand how these algorithms work, given 
a measure <f/i(^)> define the opacity at C of utility U as: 


j dCV(C'K) 


|l/(0-E/«yCi)l 


(3) 


where J is defined in terms of the underlying probability 
distributions, 3 and (£,, C„) is defined as the worldline whose 

3 Writing it out in full, J{C' I C) = I s )/E(C-n I 


*7) components are the same as those of C' whlle ds n ex- 
ponents axe the same as those of £ ([28] ) ■ 

The denominator absolute value in the integrand in Equa- 
tion 3 reflects how sensitive U(0 is to changing (v In con- 
trast, the numerator absolute value reflects how sensitive 
Cf(<) is to changing So the smaller the opacity of a util- 
ity function g v , the more «,„(<) depends only on the move of 
player n, i.e., the better the associated signal-to-noise ratio 
for 7 j. Intuitively then, lower opacity should mean it is easier 
for j) to achieve a large value of its intelligence. 

To formally establish this, we use the same measure d/i 
to define opacity as the one that defined intelligence. Under 
this choice expected opacity bounds how close to 1 expected 
intelligence can be [28]: 

E(eu(C ■ *?) I «) < 1 - K > where 
K < E(nu(C ■■ r),s) I s). (5) 

So low expected opacity of utility g v ensure that a necessary 
condition is met for the third term in Equation 2 to have the 
desired form for player tj. While low opacity is not, formally 
speaking, also sufficient for E(eu( C : V) I s ) t0 be clo f e to 
in practice the bounds in Equation 5 are usually tight. 

2.3 Difference Utilities 

It is possible to solve for the set of all utilities that are 
factored with respect to a particular world utility. Unfortu- 
nately, in general it is not possible for a collective both to 
be factored and to have zero opacity for all of its players. 
However consider difference utilities, which are of the form 

u( o = G( 0 - r(/(0) W 

where T(f) is independent of Co- Any difference utility is 
factored [26], and under benign approximations, E(il u \ s ) 
is minimized over the set of such utilities by choosing 

r(/(C)) = E(G | Co,«) . < 7) 

up to an overall additive constant. We call the resultant 
difference utility the Aristocrat utility (AU), loosely re- 
flecting the fact that it measures the difference between a 
player’s actual action and the average action. 

If possible, we would like each player 77 to use the associ- 
ated AU as its utility function to ensure good form for both 
terms 2 and 3 in Equation 2. This is not always feasible how- 
ever. The problem is that to evaluate the expectation value 
defining its AU each player needs to evaluate the current 
probabilities of each of its potential moves. However if the 
player then changes its utility function to be the associated 
AU it will in general substantially change its ensuing behav- 
ior (The player now wants to choose moves that maximize 
a different function from the one it was maximizing before.) 
In other words, it will change the probabilities of its moves, 
which means that its new utility function is in fact not the 
AU for its actual (new) probabilities. 

There are ways around this self-consistency problem, but 
in practice it is often easier to bypass the entire issue, by 


with: 

JKn,C I <*,«) = — 2 

p{ c; 1 c v j)j*(Ch 1 Cv.- 9 )MCi) 
2 



giving each rj a utility function that does not depend on the 
probabilities of tj’s own moves. One such utility function is 
the Wonderful Life Utility (WLU). The WLU for player r, 
is parameterized by a pre-fixed clamping parameter CL V 
chosen from among r/’s possible moves: 

WLU n = G { 0 - CL n ) . (8) 

WLU is factored regardless of the choice of clamping param- 
eter. Furthermore, while not matching AU’s low opacity, 
WLU usually has far better opacity than does a team game. 

3. THE COIN FRAMEWORK FOR SYSTEMS 
WITH MARKOVIAN EVOLUTION 

We consider games which consist of multi-step “episodes . 
Within each episode the entire system evolves in a Marko- 
vian manner from the initial moves of the players. We are 
interested in such games where some of the players 17 are 
not agents whose intial state is under control of a learning 
algorithm that we control, but rather constitute an “envi- 
ronment” for those controllable agents (i.e., where some of 
the players correspond to the state of nature). 

Let A be the Markovian single step evolution operator of 
the entire system through an episode, 

C t = ACt-i 

Each component C?, for example, could be a one-dimensional 
real number. The row vector A v would then be rj's update 
rule. Alternatively, each agent could be represented by one 
of N symbolic values. In that case, would be given in a 
unary representation as a vector in V* n (i.e. a Haar ba- 
sis). Considering such large spaces are necessary to describe 
arbitrary, nonlinear dynamics as Markovian evolution. Here 
we will concentrate on the former case, where the moves of 
the players are all real numbers. 

The full multiple time step evolution of an episode is given 
by single step operator in the usual way: Let 


T A 1 



J 

where T is the number of time steps per episode. This opera- 
tor applied to our initial state <fo yields the entire “worldline” 
or time history, of the system. 

c = c( 0 . ( 10 ) 

We consider difference utility functions of the form 

g„(0 = G(c£o) - Tn(F v c£o) ( n ) 

where G is the world utility function tojie optimized. We 
will choose F n so that the product F^CQo is independent of 
agent jj’s actions. This is a necessary and sufficient condition 
for the associated difference utility g v { C) t0 be factored with 
respect to the world utility G for any and all choices of 
r„. In general, r, can be chosen in such a way to optimize 
learnability. Here though, for simplicity, we choose T„ - G. 
Accordingly, application of the F v operator is an ins _ t; ^ ce 
of transforming the argument of the (second term of the) 
utility functions of the agents, i.e., it is a TAU process. 


It is important to note that the particular form of C given 
above is not necessary for the results and methods of this 
paper to apply. In fact, there is no reason even to view the 
COIN-based choice of the g v as optimizing G for a multi- 
step game involving a “dynamics” process in some sense, 
can be viewed as simply optimizing some G(C(o) for some 
“abstract” function C. As we will see, a major advantage of 
our approach to optimizing functions of this form is that C 
only needs to be run once to set the values of p„. Moreover, 
this single running of C is automatically done ‘ by nature as 
the system runs. There is no extra burden on the individual 
agents to perform calculations involving C, for example, to 
r .. . — ^ 7iFpr-fArf.il al moves. 


4. LINEAR EVOLUTION 

4.1 Avoiding re-evolution of the system 

We now consider the operator F n for the case of linear evo- 
lution (i.e. C is a linear operator). For simplicity episodes 
are composed of one time step (i.e. C = A), and agents ini- 
tially exist in one of two states (i.e., the players under our 
control can make one of two moves). As mentioned above, a 
sufficient condition for tj’s difference utility gr, to be factored 
is that the combination F V CC 0 is independent of r) s initial 
action. One way to accomplish this starts by clamping fj s 
initial action, producing CL V Co, where CL n is a clamping 
operator represented by a decimated identity matrix with 
zero-valued diagonal element at position jj. This clamped 
state must then be re-evolved to produce the desired com- 

bination, C(CLr)Co)- . 

Unfortunately doing this means re-evolving the entire sys- 
tern which may be computationally prohibitive, especially 1 
it must be done for each agent. We define, therefore a post- 
evolution clamping operator F v such that F V C ^ C{ rj, 
and therefore no re-evolving is needed once C Co has been 
evaluated (by nature). It follow that 


F„ = C(CL n )C~ 


( 12 ) 


The spectral structure of the operator F n is readily deter- 
mined. The eigenvalues are A* = 1 . - 6 k , n where is^t e 
Kronecker delta. Corresponding eigenvectors are e k - c k 
where (c k ) are the columns of the linear evolution operator 
C. Since they span the space the post-evolved state can be 
expanded in terms of these eigenvectors of F„: 


£t = Y^ ak ^- 


(13) 


Application of F n to the post-evolved_state in this basis is 
itraightforward. The result is F v ijt = C : — o.r,c^ where n v 
he projection of £ in the direction of c„. Furthermore, since 
:igen vectors of F n correspond to columns of C, the matrix 
7- 1 acts as a projector onto this basis. Using this fact and 
ecalling that £ = C£t, it can be shown that a, = Co ^ 
iquals agent q's action at f = 0. Thus, F n can be completely 
ixpressed in terms of observed post-evolution quantities. 

f ,£ = £-< o%- ( 14 5 

n this way we can calculate the result of clamping the initial 
state and re-evolving without performing that re-evolution. 
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Figure 1: Performance for 50 agents with linear dy- 
namics when the environment is set to zero at the 
beginning of each episode. Results for TAU g are 
represented by + , results for 75% observability TAU, 
0 75% , are □, then applying L to the first as well as 
second terms gives the utility g 7 n f S with results de- 
picted as *. g 2S% is x, g™ % is A, and finally, G, the 
team game, is 0. Errorbars are too small to see. 
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Figure 2: System performance for 50 agents with 
linear dynamics in a random environment. Key is 
same as Figure 1. The small degradation in perfor- 
mance due to randomness from the environment. 

4.2 Observability restrictions 

In practice, the full worldline of the system may not be 
fully observable to each agent. Such limited observability 
of a particular component may be determined by the prob- 
lem. In other cases, due to communication constraints each 
agent is only allowed to observe a certain number of compo- 
nents, and must select which such components to observe, 
for example to optimize some auxiliary quantity like opac- 
ity. Similarly, the dynamics may not be known exactly to 
the agent; some rows of C may be uncertain to an agent, 
or simply cannot be determined. In these kinds of situa- 
tions the g n described above cannot be evaluated at the end 
of an episode by agent 77 , even if the value <3(Ct is globally 
broadcast to all agents. 

The TAU approach outlined above is well-suited to ad- 
dress such situations. Formally, a decimated identity oper- 
ator L can be defined whose diagonal elements are {0, 1} 
depending on whether or not they are observable. The cor- 
responding factored utility for agent 77 is 

g v ((t) = G(£t) - G(lf„£), (is) 

where in general L may vary with 77 . Given global broadcast 




Figure 3: Comparison of average noise for factored 
g 7sk ( U pp er graph) and nonfactored g 7 n j (lower 
graph) utility functions with 75 % observability. The 
first 100 time steps are the training period. 

to all agents of the value of <?(£), for each agent to evaluate 
this type of g rj only requires that those components of F^Qt 
that are non-zero (and therefore can vary) after application 
of the L operator be observed. 

This difference utility has two main sources of noise, one 
from potentially poor choice of the clamping operator, and 
the other from the use of L in the second (subtracted) term 
but not in the first. To address that latter source of noise we 
can impose limited observability on the first term in addition 
to the second one, getting 

9 , (CO = G(iCi) - G(LF v O). (16) 

The new utility is not factored with respect to G. Ac- 
cording to the central equation however, it may still result 
in better performance than when we don’t have L in the first 
term, if the improvement in opacity more than offsets the 
loss of exact factoredness. In addition to the potential for 
such far superior opacity, this utility has the added advan- 
tage that now we don’t even need to rely on global broadcast 
of G(L£t) to evaluate g v . 

4.3 Experiments 

Numerical simulations were performed with 50 agents. Af- 
ter an initial 100-episode training period, agents selected 
initial actions in each subsequent episode with the same re- 
inforcement learning algorithm used in our previous work. 
All players underwent linear dynamics within each episode. 
The world utility function was a spin glass, 

Gt = 

i<i 

We collected statistics by averaging runs over many ran- 
domly set matrices A and coupling constants J;>. These 
runs were for systems whose first 25% and 75% components 
at the end of the episode are observable, given some canon- 
ical ordering of agents. We considered both the case where 
the environment was initialized to zero (Figure 1) and where 
it was initialized randomly (Figure 2). We examined world 
utility value vs. episode number for six utility functions. 

1) TAU g for a fully observable system; 

2) TAU g for 75 % observability, g 75% ; 

3) The modification g™* gi vin S a non-factored system, 
again with 75 % observability; 

4) 9 25% for a factored system with 25 % observability; 








5) g^f for a non-factored system with 25 % observability; 

6) The team game, where every g v = G 

Even the results for limited observability clearly outper- 
form the corresponding team game in which there is fu 
observability. Furthermore, for 75% observability, the non- 
factored utilities ( L in both terms) consistently outperform 
their factored counterpart. In these runs factoredness fell to 
approximately 90%, while noise levels in the utility functions 
were as shown in figures 3 and 4. The improvement in per- 
formance due to better signal-to-noise more than outweighs 
the degradation due to loss in factoredness. 


5. NONLINEAR EVOLUTION 

Generalizing these results to arbitrary nonlinear dynamics 
requires high dimensional representations. In particular, in 
the case where all agents’ states are binary, the number of 
joint states grows as 2 N where N is the number of agents. 
The successive bits in such a representation can be indicated 
as {x»} e B = {—00,00} where we have N bits altogether. 
Alternatively, we can expand the joint state in the basis of 

Walsh functions {*<**,«}. ■••> which ^ the SCt 

of all functions taking elements of the space B to B. 

Doing this reduces the original nonlinear dynamics to lin- 
ear dynamics, at the price of expanding the size of the space. 
As an example, in the case of a quadratic update rule, we 
can represent <0 in terms of second order Walsh functions 
ixiX ,«}. Evolution of the system is accomplished by appli- 
cation of the associated evolution matrix C or A, yielding 
Ct = CCo- To obtain factored utility functions, analagous 
post-evolution operators F„ can be constructed. To ensure 
that the second term in the difference utility is independent 
of t?= all terms involving x„ will have to be subtracted. In 
the quadratic case, N such terms will have to be subtracted 
whereas in the linear case there was only one term. We find 

Fr,Ct = Ct — Co ( 18 ) 

t 

where 5 „ is the column of C corresponding to the Walsh 
function XiX„ i = Results of experiments for this 

case with 50 agents are presented in Figure 5. 

6. TAYLOR SERIES METHOD 

To address the more general nonlinear problem, we con- 
sider a slightly different framework. In this case, each agent 


is assigned a real- valued number r,. The state of the sys- 
tem Ct is a vector with these numbers as components. Each 
agent can choose among three actions which results m r„ be- 
ing modified by {±A,0}. Nonlinear update rules Ct - *(<o) 
are functions of these real- valued variables. 

Construction of factored utilities 

g n (Ct) = G(c( Co)) - G(c(CLCo)). (19) 


requires that c( Co) be independent of rj choice of action. 
One way to accomplish this to clamp (apply CL) to Co and 
reevolve the system. To avoid re-evolving the system, we 
approximate C{CL(q) with a Taylor Series expansion about 
the unclamped Co initial state. 

c(CLCo) = c(Co) + A (Co - CL( 0 ) ■ Vc(Co) (20) 


/arvine A provides us a small parameter to control the ex- 
lansion. It should be noted that while this method requires 
hat 3(C) be differentiable, the world utility G need not be. 

Figure 6 presents results for a quadratic update rule with 
andomly generated coefficients c«o) = £i,j “O'CoCo- The 
igents are given a random initial starting point with -1 < 
! < 1 Because c is quadratic, G(<«) is a quartic polynomial 
n N dimensions. Since the coefficients have random 

igns the function G has as many increasing directions as it 
lecreasing directions. The goal of the system is to traverse 
his high dimensional surface, find an increasing direc ion, 
ind then follow that direction to infinity. 

In light of the centra] equation we plot the average mtel- 
igences of the agents. For three possible actions, the best 
iction has an intelligence of 1 while the worst choice pv^ 
).33. A random walk (no learning) gives a value of 0.67 on 
iverage. We find that a team game has the same intelligence 
is a random walk. The TAU utility g displays a much higher 
ntelligence which is also reflected in better performance. 

It is interesting to adjust the ratio of ± signs in the co- 
mments of the polynomials. If we introduce, for examp e, 
nore negative coefficients than positive, we expect the sur- 
face to preferentially turn down. The task for the agents 
becomes more challenging. We find, in fact, that three of 
;he limited observability utilities perform worse over time 
' i e their world utility decreases). The team game also per- 
forms worse over time. In fact, not only does the team game 
five poor performance, but it fails altogether. e w 
’on* TAU utilities 9 and gif still give robust performance. 
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Figure 6: System performance for N — 50 agents 
using the Taylor Series method. The dynamics is 
governed by a quadratic function of the agents’ “po- 
sitions”. The world utility G is a quartic^in N di- 
mensions. (upper two graphs are g and g ^ °; middle 
two are g and g T5% ; lower two are g 25% and a team 
game G.) The initial training period is not shown. 
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Figure 8: Taylor Series method where the quadratic 
coefficients have more - than + signs, (graphs: up- 
per pair are g and middle three are g >5 > 

and the team game; lower is <# % .) In this case, 
three of the limited observability utilities and the 
team game perform worse over time (i.e. their world 
utilities decrease). 




Figure 9: Percentile intelligence for agents using 

TAU g v (upper graph) versus a team game (lower 
graph), when the surface preferentially turns down. 
The degradation in intelligence as compared to Fig- 
ure 7 reflects the greater difficulty of the problem. 
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Figure 7: Percentile intelligence for agents using 

TAU g (upper graph) versus a team game (lower 
graph). For three actions, a random walk (no learn- 
ing) would give an average intelligence of 67 %. 


To further study this dramatic difference in performance, 
we compared the average intelligence of the agents for g and 
the team game. The results are show in Figure 9. In the 
case of the team game, again, there is no appreciable change 
in intelligence from the initial training period to when the 
agents are invoking learning algorithms. Conversely, for the 
g utility, the agents perform at a higher intelligence than 
the team game albeit lower than the situation in Figure 7. 

7. CONCLUSION 

We present a detailed extension of the COIN framework 
to systems the undergo Markovian evolution. We find con- 
sistent, robust improvement of performance as compared to 
the corresponding team game. The approach is applied 
to systems with linear and quadratic (nonlinear) update 
rules. Results from numerical simulations are presented. 
This framework also naturally includes the case of limited 
observability. We found that even COIN-based utility func- 
tions constrained by limited observability often outperformed 
conventional team game utilities having full observability. 
We also found a new class of nonfactored utilities that con- 
sistently outperformed their factored counterpart, due to 





improved signal-to-noise characteristics. 

To address the general nonlinear case, we developed a 
Taylor Series method. In this case, the system of agents can 
be imagined to traverse an TV- dimensional surface. We find 
that the system’s performance can depend on the character- 
istics of the surface being optimized. We show that in some 
situations a team game will fail altogether (i.e. its perfor- 
mance will degrade over time) while the corresponding T AU 
utility continues to perform well. 
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